library(readr)
library(lattice)
library(tidyverse)
library(rticles)
library(readxl)
library(dplyr)
library(kableExtra)
library(leaflet)
library(ggplot2)
library(reticulate)
library(skimr)

Introduction
Airbnb is a rental platform that allows home-owner to list their properties(host) and guest can pay to stay in them for a duration of time. The price of these listings are set by the host themselves. The pricing of these properties is very important to get right, especially in the places with high competition. Even few dollars per night could make a difference. If the host put a price too high then there is a risk of missing potential guests and on the other hand, if they price too low then they might be missing the potential income. So it is a balancing act of which price to fix according to the different features of the rental properties.

But which properties should the host focus on while deciding the pricing? Should they fix their price based on the number of bedrooms, or number of bathrooms or a combination of both? The goal of this project is try to solve this problem. We want to study which features of the rental properties are significant in predicting the log_price and which are not. Furthermore we will be using the Bayesian methods for this.

train <- read_csv("train.csv")
skim(train)
Data summary
Name train
Number of rows 74111
Number of columns 29
_______________________
Column type frequency:
character 15
logical 4
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
property_type 0 1.00 3 18 0 35 0
room_type 0 1.00 11 15 0 3 0
amenities 0 1.00 2 1496 0 67122 0
bed_type 0 1.00 5 13 0 5 0
cancellation_policy 0 1.00 6 15 0 5 0
city 0 1.00 2 7 0 6 0
Description 0 1.00 1 1000 0 73469 0
first_review 15864 0.79 6 8 0 2554 0
host_response_rate 18299 0.75 2 4 0 80 0
host_since 188 1.00 6 8 0 3087 0
last_review 15827 0.79 6 8 0 1371 0
name 0 1.00 1 179 0 73331 0
neighbourhood 6872 0.91 4 35 0 619 0
thumbnail_url 8216 0.89 79 93 0 65883 0
zipcode 968 0.99 2 31 0 668 0

Variable type: logical

skim_variable n_missing complete_rate mean count
cleaning_fee 0 1 0.73 TRU: 54403, FAL: 19708
host_has_profile_pic 188 1 1.00 TRU: 73697, FAL: 226
host_identity_verified 188 1 0.67 TRU: 49748, FAL: 24175
instant_bookable 0 1 0.26 FAL: 54660, TRU: 19451

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 11266617.10 6081734.89 344.00 6261964.50 12254147.00 16402260.50 21230903.00 ▅▅▅▇▆
log_price 0 1.00 4.78 0.72 0.00 4.32 4.71 5.22 7.60 ▁▁▆▇▁
accommodates 0 1.00 3.16 2.15 1.00 2.00 2.00 4.00 16.00 ▇▁▁▁▁
bathrooms 200 1.00 1.24 0.58 0.00 1.00 1.00 1.00 8.00 ▇▂▁▁▁
latitude 0 1.00 38.45 3.08 33.34 34.13 40.66 40.75 42.39 ▅▁▁▁▇
longitude 0 1.00 -92.40 21.71 -122.51 -118.34 -77.00 -73.95 -70.99 ▆▁▁▁▇
number_of_reviews 0 1.00 20.90 37.83 0.00 1.00 6.00 23.00 605.00 ▇▁▁▁▁
review_scores_rating 16722 0.77 94.07 7.84 20.00 92.00 96.00 100.00 100.00 ▁▁▁▁▇
bedrooms 91 1.00 1.27 0.85 0.00 1.00 1.00 1.00 10.00 ▇▁▁▁▁
beds 131 1.00 1.71 1.25 0.00 1.00 1.00 2.00 18.00 ▇▁▁▁▁

GENERAL OVERVIEW OF THE DATA

We use the Skimr library to perform Skim to get a general overview of the data. From the above information we can see that there are 74,111 rows in our data and 29 colums/variables.

Among these 29 variables in our data 15 data are character type, 4 are logical and 10 are numerical type.

We can also see that there is a lot of missing datas. However, the variable with most frequency of missing data, and the ones that are most significant for our study are:

First review 15864 host_response_rate 18299 review_scores_rating 16722

In the quantitative variable the variance is greater than 0 for all our variables.

There are also some index variable which are listed as character variables that do not contribute to our study. These index variables need to be addressed before we move to our study.

When we look at the logical variable there are also factor level that appears tobe infrequent. The most significant infrequent factor level is stated below:

host_has_profile_pic TRU: 73697, FAL: 226

These character variables listed below has low number of n_unique data types so we further study these variables to see if they need to be converted to factor to be included in the study:

property_type bed_type
cancellation_policy
city
Description room_type

Dropping the missing values, as we had missing dates which is hard to impute, other missing values were less than 30%, so we didn’t impute them or use any other missing value handling techniques.

train <- drop_na(train)
sapply(train, function(x) length(unique(x)))
##                     id              log_price          property_type 
##                  38502                    608                     31 
##              room_type              amenities           accommodates 
##                      3                  36204                     16 
##              bathrooms               bed_type    cancellation_policy 
##                     17                      5                      5 
##           cleaning_fee                   city            Description 
##                      2                      6                  38190 
##           first_review   host_has_profile_pic host_identity_verified 
##                   2448                      2                      2 
##     host_response_rate             host_since       instant_bookable 
##                     76                   2981                      2 
##            last_review               latitude              longitude 
##                    984                  38489                  38468 
##                   name          neighbourhood      number_of_reviews 
##                  38279                    586                    365 
##   review_scores_rating          thumbnail_url                zipcode 
##                     52                  38496                    571 
##               bedrooms                   beds 
##                     11                     18

In our data there are we see 29 different variable in the given dataset.

# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 
# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(train$log_price , horizontal=TRUE , ylim=c(0,8), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(train$log_price , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="log_price", xlim=c(-0,8))

We can see that log_price is almost symmetric around 5. However we can see there are a lots of outliers in the variable.

Now after the study of all the data in a surface level we further move to and in-depth study of significant data to get more understanding of the variables:

Geographical analysis

Let us look at how our data is distributed across different cities.

m <- leaflet(train) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude, color = "red",
stroke=FALSE)
m

Here we can see that our data contains rental listing for 6 different cities: Los Angeles, New York, DC, Boston, Chicago, San Fransico etc. Now let us look at the relationship between the predictor variables and the response variable. We will first examine graphically.

First let us loook at the log price for NYC.

maindata=train
rows = (maindata$city== "NYC") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

We can see that among all the points, the top left part of the points seems have a higher log_price. That is the main NYC and Manhattan area.

Now let us look at Boston

rows = (maindata$city== "Boston") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

We can see that top right part of the data points have the higer log_price compared to other parts of the city.

Now we look at DC

rows = (maindata$city== "DC") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

The parts like Capitol Hill and Dupont Cicle have relatively higher log_price compared to other parts.

Now for San Francisco.

rows = (maindata$city== "SF") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

San Fransico has relatively uniform distribution for log_price comapred to other cities. Howerver the top part of the data points seems to have relatively higher log_price.

Now for LA.

rows = (maindata$city== "LA") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

We can see that the bay area has relatively higher log_price in LA.

Finally, let us look at Chicago.

rows = (maindata$city== "Chicago") 
tmp = maindata[rows, ]

mybins <- seq(2, 8, by=1.5)
mypalette <- colorBin( palette="YlOrBr",
domain=tmp$log_price,
na.color="transparent",
bins=mybins)

m <- leaflet(tmp) %>%
addTiles() %>%
addProviderTiles("OpenStreetMap.BZH") %>%
addCircleMarkers(~longitude, ~latitude,radius = ~log_price,
fillColor = ~mypalette(log_price),
fillOpacity = 0.5,
color = "white",stroke=FALSE
)%>%
addLegend( pal=mypalette, values=~log_price, opacity=0.9,
title = "Log_price", position = "bottomright" )
m

We can see that some of the mid portion and the edge points have relativerly higer log_price in Chicago. Overall, we can see that evern within the different cities there are certian parts that have higher log_price compared to other parts. Therefore, cities can also be a predictor of log_price.

Let us compare all the cities together.

#All the 5 different cancellation_policy were studied to see their relationship with log_price through the use of box plot

p <- ggplot(train, aes(x= city, y=log_price))+
  geom_boxplot(outlier.colour="black", outlier.shape=1,
                outlier.size=1)+
  ggtitle("Boxplot for log_price vs city") 
p 

We can see that Boston and SF has the relative higher average means compared to other cities.

Numeric Variables

Now we will examine the correlation between the numberic variables and the log_price. We will do scatter plot for this.

train %>% filter(review_scores_rating > 60)->tmp
par(mfrow=c(2,2))
ggplot(data=train, aes(x=beds, log_price)) + geom_point(col="#69b3a2", size=0.5) ->beds
ggplot(data=train, aes(x=bathrooms, log_price)) + geom_point(col="#69b3a2", size=0.5) ->bathroom
ggplot(data=train, aes(x=bedrooms, log_price)) + geom_point(col="#69b3a2", size=0.5) ->bedroom
ggplot(data=train, aes(x=accommodates, log_price)) + geom_point(col="#69b3a2", size=0.5) -> acc

ggpubr::ggarrange(beds, bathroom, bedroom ,acc,
          ncol = 2, nrow = 2)

We can see that overall the log_price increase as the value for these numeric variable increases. This matches with our general intuition that the price of the propertes depends upon the number of different characterstics. For example, as the number of beds increases, the log_price increses and so does for the number of bathrooms, bedrooms and accommodates.

ggplot(data=train, aes(x=host_response_rate, log_price)) + geom_point(col="#69b3a2", size=0.5) ->response
ggplot(data=train, aes(x=number_of_reviews, log_price)) + geom_point(col="#69b3a2", size=0.5) -> review
ggplot(data=train, aes(x=review_scores_rating, log_price)) + geom_point(col="#69b3a2", size=0.5) -> score

ggpubr::ggarrange(response, review ,score,
          ncol = 2, nrow = 2)

Let’s now look into different categorical variables in the dataset

1)Cancellation policy

We further look at Cancellation policy to investigate it further for our data study:

Despite in the initial study the cancellation policy appeared to be character when we have a deeper look we find that cancellation policy is instead a factor that needs to be accounted for. WE convert the cancellation policy to factor so we can look at the count to get an understanding of the distribution of this variable. We see that flexible, moderate and strict has the majority of frequency where as super_strict_30 and super_strict_60 has less significant frequency.

##The distinct cancellation_policy are recorded
train %>% group_by(cancellation_policy) %>% select(cancellation_policy) %>% unique()->cancellation_policy_grouped
cancellation_policy_grouped
train %>% mutate_at(vars(cancellation_policy),as.factor)%>% count(train$cancellation_policy)

We learn from the above display that cancellation_policy has 5 types. The types have been listed above

We now create a box plot to study the cancellation policy against log_price since the variables under study are quantitative variables against qualitative variables:

##All the 5 different cancellation_policy were studied to see their relationship with log_price through the use of box plot

p <- ggplot(train, aes(x= cancellation_policy, y=log_price))+
  geom_boxplot(outlier.colour="black", outlier.shape=1,
                outlier.size=1)+
  ggtitle("Boxplot for log_price vs cancellation_policy") 
p + theme_minimal()

Shape:

From the box plot we can see that super_strict_60 is skewed to the left.super strict is also skewed to the left but only slightly.Moderate seems to be skewed slightly to the right while flexible look to be normal distribution.

Center:

From the above boxplot we see that median of log_price for super_strict_60 seems to be the highest compared to all cancellation_policy. It is then followed by super_strict_strict_30, strict, moderate, flexible. There seems to be a pattern that as the cancellation policy become flexible the log_price median seems to decrease.

Range:

Super_strict_60 has the highest IQR compare to other cancellation_policy while super_strict_30 has the lowest IQR.

Unusual:

There are many outliers for flexible, moderate and strict which is reasonable as they are the types with most frequency.

2) Property_type

We now move to Property_type variable to investigate this variable:

Similar to cancellation_policy we find propety type also need to be converted to factor and further study the variable and the frequency.

##Only 10 different distinct property_types are recorded
train %>% group_by(property_type) %>% select(property_type) %>% unique()
count(train, property_type) %>% mutate(relative_freq = (n/sum(n)))
train %>% group_by(property_type) %>% select(property_type, log_price) %>%filter(property_type=="Apartment" | property_type =="Bed & Breakfast" | property_type =="Bungalow" | property_type == "Condominium" | property_type == "House" | property_type == "Loft" | property_type == "Townhouse") -> modified_property_type
count(modified_property_type, property_type) 

The property type variable and the frequency of the most significant types are listed above in the table. We see that the Apartment type is the most frequent type followed by House.

We now create a box plot to study the property type against log_price since the variables under study are quantitative variables against qualitative variables:

##Those selected property_types were studied through the use of boxplot to see their relationship with log_price 
ggplot(modified_property_type, aes(x= property_type, y=log_price))+geom_boxplot()+ ggtitle("Boxplot for log_price vs property_type")

Shape:

House and township are skewed to the right. Bed & Breakfast is slightly skewed to the left. The other type has normal distribution.

Center: Above we have log_price against property_type. We see that property_type does not have a strong effect on the log_price as median log_price for all property_type fall around the same area. The median log_price of condominium has the highest median among other property types.

Spread:

House has the highest IQR however while Bungalow has the lowest IQR.

Unusual:

There are most outliers for apartment which is also reasonable as it has the highest frequency.

3) Room Type

Now we look at Room Type variable to study it in more depth:

##Three different room_types were studied and each of their relative frequency is calculated to find their individual share
train %>% group_by(room_type) %>% select(room_type) %>% unique()
count(train, room_type) %>% mutate(relative_freq = (n/sum(n)))

There are 3 types of room types namely Entire home/apt, Private room and Shared room. The frequency and relative frequency have been listed above. Shared room has the lowest relative frequency while Entire home/apt has the most relative frequency.

We now draw the box plot foor room_type vs log_price to look at the shape, center, outliers and spread of the data since the variables under study are quantitative variables against qualitative variables:

##Those 3 kinds of room_types were studied through the use of boxplot to see their relationship with log_price 
ggplot(train, aes(x= room_type, y=log_price))+geom_boxplot()+ ggtitle("Boxplot for log_price vs room_type")

Shape:

Entire home/apt, private room and Shared room all have close to normal distribution.

Center: The median log_price is highest for Entire home/apt followed by private room and lowest for Shared room.

Spread:

The IQR for shared room is slightly more compared to Entire home/apt. Private has the lowest IQR.

Unusual:

There seems to be more outliers for Private room.

4) Cleaning fee

We will now investigate cleaning fee against log_price to further understand the data. For this we will plot a box plot as there is a qualitative and a quantitative variable:

##Those two conditions of cleaning fee(i.e. either yes or no) were studied through the use of boxplot to see their relationship with log_price 

ggplot(train, aes(x= cleaning_fee, y=log_price))+geom_boxplot()+ ggtitle("Boxplot for log_price vs cleaning_fee")

Shape:

Both cleaning fee false and true has a close to normal distribution.

Center: The median log_price is highest for cleaning_fee true than cleaning_fee false but it is very slight.

Spread:

The IQR for both cleaning fee false and true are very much comparable and looks equal.

Unusual:

There seems to be outliers for both cleaning fee false and true.

Assosiation between the predictor variables

Here, we try to investigate the relation/association between the predictor variables in our dataset to get the understanding of the predictor variables. We are also interested to see which predictor variables have similar effect.

room_type vs instant_bookable

Initially we investigate room_type and instant_bookable.Since both the variables under study are both qualitative variables we used mosaic plot to investigate the association.

library(vcd)
## Loading required package: grid
mosaic(~room_type + instant_bookable, data = train, shade = TRUE)

From above mosaic plot we see, the two-tiles shaped deep blue correspond to only one cell having shared room_type and true instant_bookable whose residuals are greater than +4 indicating much greater frequency in this cell than would be found if the hypothesis of independence. The two-tile shaped having shared room_type with false instant_bookable and entire home/apt room_type with true instant_bookable corresponds to the residuals between -2.0 and -3.7 which indicates the combination is extremely rare under the hypothesis of independence.

city vs cleaning_fee

Now, we investigate the association between cleaning_fee and city using mosaic plot.

library(vcd)
mosaic(~ city+cleaning_fee , data = train, shade = TRUE)

Based on above mosaic plot property with cleaning fee true in NYC and LA has the highest relative frequency.

We can also see that there are more property with cleaning fee false in NYC than we would have expected if the variables were independent and there are less property with cleaning fee True in NYC than we would have expected if the variables were independent.

However, there are fewer property with cleaning fee false in LA than we would have expected if the variables were independent and there are more property with cleaning fee true in LA than we would have expected if the variables were independent.

Hence, we can say that there seems to be association between two variables.

room_type vs cleaning_fee

We will now investigate the room_type and cleaning_fee similarly using the mosaic plot:

##similar case like above
mosaic(~ room_type+cleaning_fee, data = train, shade = TRUE)

Based on above mosaic plot Entire home/apt with cleaning fee true has the highest relative frequency.

We can also see that there are more Entire home/apt with cleaning fee true than we would have expected if the variables were independent and there are less Entire home/apt with cleaning fee false than we would have expected if the variables were independent.

However, there are fewer private room with cleaning fee true than we would have expected if the variables were independent and there are more private room with cleaning fee false than we would have expected if the variables were independent.

Finally,there are fewer shared room with cleaning fee true than we would have expected if the variables were independent and there are more shared room with cleaning fee false than we would have expected if the variables were independent.

Hence, we can say that there seems to be association between room_type and cleaning_fee.

room_type Vs city

Moving on to investigating room_type with city using mosaic plot:

#similar case like above
mosaic(~city + room_type, data = train, shade = TRUE)

Based on above mosaic plot Entire home/apt in NYC has the highest relative frequency followed by Entire home/apt in LA.

We can also see that there are less Entire home/apt and shared room in NYC than we would have expected if the variables were independent and there are more Private in NYC than we would have expected if the variables were independent.

However, there are fewer Private room in LA than we would have expected if the variables were independent and there are more Entire home/apt and shared room in LA than we would have expected if the variables were independent.

Hence, we can say that there seems to be association between room_type and city.

instant_bookable vs cleaning_fee

We will now used the fourfold plots to study the relationship between instant bookable and cleaning fee.

##these variables are highly significant
library(vcd)
fourfold(xtabs(~ instant_bookable + cleaning_fee, data = train))

There is no difference between in frequency of cleaning_fee being true or false when compared with value of instant_bookable because the confidence interval exactly bands overlap in the plot.

Hence, the isn’t any proof of association.

Bedrooms

train %>% group_by(bedrooms) %>% select(bedrooms) %>% unique()
count(train, bedrooms) %>% mutate(relative_freq = (n/sum(n)))
hist(train$bedrooms)

From the above histogram for bedrooms we can see that there is highest number of single bedroom among all property enlisted and some 2 bedrooms property. There is rarely any more than 4 bedrooms which makes sense as it is unusual to have more than 4 bedrooms,

Conclusion
We can observe from our analysis that categorical variables such as room type, cleaning fee and also geolocation are important features for log_price. Overall we can conclude that categorical variables are more important feature that impacts the log_price.